The full modulation spectrum is a high-dimensional representation of one-dimensional audio signals. Most previous\nresearch in automatic speech recognition converted this very rich representation into the equivalent of a sequence of\nshort-time power spectra, mainly to simplify the computation of the posterior probability that a frame of an unknown\nspeech signal is related to a specific state. In this paper we use the raw output of a modulation spectrum analyser in\ncombination with sparse coding as a means for obtaining state posterior probabilities. The modulation spectrum\nanalyser uses 15 gammatone filters. The Hilbert envelope of the output of these filters is then processed by nine\nmodulation frequency filters, with bandwidths up to 16 Hz. Experiments using the AURORA-2 task show that the\nnovel approach is promising. We found that the representation of medium-term dynamics in the modulation\nspectrum analyser must be improved. We also found that we should move towards sparse classification, by modifying\nthe cost function in sparse coding such that the class(es) represented by the exemplars weigh in, in addition to the\naccuracy with which unknown observations are reconstructed. This creates two challenges: (1) developing a method\nfor dictionary learning that takes the class occupancy of exemplars into account and (2) developing a method for\nlearning a mapping from exemplar activations to state posterior probabilities that keeps the generalization to unseen\nconditions that is one of the strongest advantages of sparse coding.
Loading....